READ - I've created a python script to allow the user to click the button and then they can either have all of the underyling code shown, OR they can just look at the raw output (charts, plots, whatever).
As you know, sometimes these notebooks contain a fair amount of code... and sometimes folks just want the results... here is an example

Code defaults to NOT showing any code, so click the toggle button to view the show the underlying code...


Unsupervised Machine Learning Approach

Methodology: DBSCAN Iteration and Tuning

Univariate Temperature Data


Approach: https://scikit-learn.org/stable/modules/neighbors.html

Nearest Neighbors !

Import standard python libraries

Move directories to read in file

Read in data (parquet saved and exported file from previous notebook

Convert temperature variable from C to F


Plotting our raw temperature data to see what it looks like

Converting raw data via StandardScaler()

STOP and think:


Assimilated Observations:

  1. As min_samples value increases, in general the number of Cluster -1 anomalies appears to increase, but not dramatically

  2. eps value of 0.01 appears (in my estimation) to be HIGH of a value !

  3. The larger that eps value gets, the LONGER it takes to process mathematically and thus compute wise !

  4. Definitions: eps: Two points are considered neighbors if the distance between the two points is below the threshold epsilon. min_samples: The minimum number of neighbors a given point should have in order to be classified as a core point. It’s
    important to note that the point itself is included in the minimum number of samples. metric: The metric to use when calculating distance between instances in a feature array (i.e. euclidean distance).

  5. As the number of min_samples increase, in theory the cluster counts found should decrease...

  1. As the number of min_samples (for a set eps value) increases, the NUMBER of cluster -1 matches will increase...

-- DBSCAN Gridsearch 1 -- Running DBSCAN on the raw data and varying eps/min_samples hyperparameters to determine outcome

I previously ran this...

import sys

# THIS IS REALLY CHECKING A HUGE MUMBER OF VARIATIONS...

eps_range = [.02]

# , .02, .03, .04, .05, .06, .07, .08, 
#              .09, .10, .11, .12, .13, .14, .15, .16, 
#              .17, .18, .19, .20]

min_samples_range = [3, 4, 5, 6, 7, 8, 9, 10, 11, 
                     12, 13, 14, 15, 16, 17, 18, 
                     19, 20, 21, 22, 23, 24, 25]

sys.stdout = open('D:\dbscan_gridsearch_02.txt','wt')

print('Tom Bresee:')
print('')
print('eps range:')
print(eps_range)
print('')
print('min_samples range:')
print(min_samples_range)
print('')

for i in eps_range:
    for j in min_samples_range:
        print('--- NEW VARIABLE ---')
        db = DBSCAN(eps=i, min_samples=j, metric='euclidean', n_jobs=-1).fit(X)
        label=db.labels_
        sample_cores=np.zeros_like(label,dtype=bool)
        sample_cores[db.core_sample_indices_]=True
        n_clusters=len(set(label))- (1 if -1 in label else 0)
        print('eps:', i, "  min_samples:", j)
        print('Cluster Count:',n_clusters)
        tsys["cluster"] = db.labels_
        print(tsys["cluster"].value_counts(), '\n')
        print("")

Output:
Tom Bresee:

eps range:
[0.01]

min_samples range:
[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]


--- NEW VARIABLE ---
eps: 0.01   min_samples: 3
Cluster Count: 3
0    203424
1        54
2         3



--- NEW VARIABLE ---
eps: 0.01   min_samples: 4
Cluster Count: 2
 0    203424
 1        54
-1         3



--- NEW VARIABLE ---
eps: 0.01   min_samples: 5
Cluster Count: 2
 0    203424
 1        54
-1         3



--- NEW VARIABLE ---
eps: 0.01   min_samples: 6
Cluster Count: 2
 0    203424
 1        54
-1         3



--- NEW VARIABLE ---
eps: 0.01   min_samples: 7
Cluster Count: 2
 0    203423
 1        54
-1         4



--- NEW VARIABLE ---
eps: 0.01   min_samples: 8
Cluster Count: 3
 0    203423
 1        47
 2         6
-1         5



--- NEW VARIABLE ---
eps: 0.01   min_samples: 9
Cluster Count: 4
 0    203354
 1        69
 2        39
-1        13
 3         6



--- NEW VARIABLE ---
eps: 0.01   min_samples: 10
Cluster Count: 3
 0    203354
 1        69
 2        31
-1        27



--- NEW VARIABLE ---
eps: 0.01   min_samples: 11
Cluster Count: 3
 0    203354
 1        69
-1        34
 2        24



--- NEW VARIABLE ---
eps: 0.01   min_samples: 12
Cluster Count: 3
 0    203354
 1        69
-1        34
 2        24



--- NEW VARIABLE ---
eps: 0.01   min_samples: 13
Cluster Count: 5
 0    198366
 4      4988
 1        69
-1        34
 2        13
 3        11



--- NEW VARIABLE ---
eps: 0.01   min_samples: 14
Cluster Count: 3
 0    198366
 2      4988
 1        69
-1        58



--- NEW VARIABLE ---
eps: 0.01   min_samples: 15
Cluster Count: 3
 0    198366
 2      4988
-1        64
 1        63



--- NEW VARIABLE ---
eps: 0.01   min_samples: 16
Cluster Count: 4
 0    198365
 3      4988
-1        73
 1        34
 2        21



--- NEW VARIABLE ---
eps: 0.01   min_samples: 17
Cluster Count: 4
 0    198362
 3      4991
-1        73
 1        31
 2        24



--- NEW VARIABLE ---
eps: 0.01   min_samples: 18
Cluster Count: 4
 0    198361
 3      4990
-1        80
 1        27
 2        23



--- NEW VARIABLE ---
eps: 0.01   min_samples: 19
Cluster Count: 4
 0    198361
 3      4988
-1        85
 1        26
 2        21



--- NEW VARIABLE ---
eps: 0.01   min_samples: 20
Cluster Count: 3
 0    198361
 2      4988
-1       107
 1        25



--- NEW VARIABLE ---
eps: 0.01   min_samples: 21
Cluster Count: 3
 0    198358
 2      4986
-1       116
 1        21



--- NEW VARIABLE ---
eps: 0.01   min_samples: 22
Cluster Count: 2
 0    198357
 1      4986
-1       138



--- NEW VARIABLE ---
eps: 0.01   min_samples: 23
Cluster Count: 2
 0    198357
 1      4986
-1       138



--- NEW VARIABLE ---
eps: 0.01   min_samples: 24
Cluster Count: 2
 0    198357
 1      4986
-1       138



--- NEW VARIABLE ---
eps: 0.01   min_samples: 25
Cluster Count: 4
 0    198355
 3      4943
-1       141
 1        30
 2        12

Tom Bresee:

eps range:
[0.02]    < - - - - EPS way too big i believe ! 

min_samples range:
[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]

--- NEW VARIABLE ---
eps: 0.02   min_samples: 3
Cluster Count: 1
0    203481
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 4
Cluster Count: 1
0    203481
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 5
Cluster Count: 1
0    203481
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 6
Cluster Count: 1
0    203481
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 7
Cluster Count: 1
0    203481
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 8
Cluster Count: 1
 0    203480
-1         1
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 9
Cluster Count: 1
 0    203480
-1         1
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 10
Cluster Count: 1
 0    203480
-1         1
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 11
Cluster Count: 1
 0    203479
-1         2
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 12
Cluster Count: 1
 0    203479
-1         2
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 13
Cluster Count: 2
 0    203425
 1        53
-1         3
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 14
Cluster Count: 2
 0    203425
 1        53
-1         3
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 15
Cluster Count: 2
 0    203425
 1        53
-1         3
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 16
Cluster Count: 2
 0    203425
 1        53
-1         3
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 17
Cluster Count: 2
 0    203425
 1        52
-1         4
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 18
Cluster Count: 2
 0    203424
 1        47
-1        10
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 19
Cluster Count: 2
 0    203424
 1        44
-1        13
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 20
Cluster Count: 2
 0    203424
 1        42
-1        15
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 21
Cluster Count: 2
 0    203424
 1        42
-1        15
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 22
Cluster Count: 2
 0    203424
 1        33
-1        24
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 23
Cluster Count: 2
 0    203424
 1        33
-1        24
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 24
Cluster Count: 2
 0    203424
 1        30
-1        27
Name: cluster, dtype: int64 


--- NEW VARIABLE ---
eps: 0.02   min_samples: 25
Cluster Count: 2
 0    203424
-1        31
 1        26
Name: cluster, dtype: int64

Tuning

ASSUME: min_samples: 40

After I selected my MinPts value, I used NearestNeighbors from Scikit-learn, documentation here, to calculate the average distance between each point and its n_neighbors. The one parameter you need to define is n_neighbors, which in this case is the value you choose for MinPts.

Calculate the average distance between each point in the data set and its 40 nearest neighbors (my selected MinPts value)


ASSUME: min_samples: 50

ASSUME: min_samples: 100

ASSUME: min_samples: 10

We can calculate the distance from each point to its closest neighbour using the NearestNeighbors.

The point itself is included in n_neighbors.

The k-neighbors method returns two arrays, one which contains the distance to the closest n_neighbors points and the other which contains the index for each of those points.

IMPORTANT: PLOTTING FOR CLARITY


Observation: It doesn't appear to matter how many min_samples you use !


Analysis: Choosing the right eps value based on these above plots !

image.png

.003 appears to be a pretty good optimal eps hyperparameter to utilize according to this investigation...


Redoing the analysis of clusters with our determined optimal eps value...

Output report: https://raw.githubusercontent.com/tombresee/SensorAnalysis/main/ENTER/results/FINAL_dbscan_gridsearch_export_correct.txt I think this helped alot in triangulating things...

Redoing the analysis of clusters with our determined optimal eps value...

This we will find will return us close results to our original 201 notebook.

-- ITERATION 1 -- Plotting the clusters that DBSCAN identified, with our base eps and min_samples hyperparameters:

-- ITERATION 1 -- Translating / Understanding what we are seeing with anomaly detection:

-- ITERATION 1 -- Visualizing some closeup ranges (red anomaly and blue anomaly regions):

As a reference, we use plotly to show all the temperature values over the entire range, so readers can home in on the window they wish to view/explore...

https://ghcdn.rawgit.org/tombresee/SensorAnalysis/main/ENTER/results/temp_1q_2019_chicago.html

-- ITERATION 1 -- Outputting and exporting the actual anomalous values (ranges), and plotting shaded regions...



ITERATE 1:

https://en.wikipedia.org/wiki/January%E2%80%93February_2019_North_American_cold_wave

https://weatherspark.com/m/14091/2/Average-Weather-in-February-in-Chicago-Illinois-United-States

Five records were set for Chicago for the month of January 2019:

RESULTS:

db = DBSCAN(eps=0.01, min_samples=30, metric='euclidean', n_jobs=-1).fit(X)
 0    139272
 3      4284
 4       664
 2       403
-1       126
 1       104


BELOW SHOWN IN PLOT:

db = DBSCAN(eps=0.01, min_samples=20, metric='euclidean', n_jobs=-1).fit(X)
 0    139842
 1      4988
-1        23
NOT BAD !

Calculating Metrics:

Using Tuning Approach:

Appendix: Do Not Delete

from sklearn.cluster import DBSCAN
clustering1 = DBSCAN(eps=0.09, min_samples=6).fit(np.array(ts_dataframe['Normalized Profit']).reshape(-1,1))

labels = clustering1.labels_

outlier_pos = np.where(labels == -1)[0]

x = []; y = [];
for pos in outlier_pos:
    x.append(np.array(ts_dataframe['Normalized Profit'])[pos])
    y.append(ts_dataframe['Normalized Profit'].index[pos])

plt.plot(ts_dataframe['Normalized Profit'].loc[ts_dataframe['Normalized Profit'].index], 'k-')
plt.plot(y,x,'r*', markersize=8)  
plt.legend(['Actual', 'Anomaly Detected'])
plt.xlabel('Time Period')
plt.xticks([0, 20, 40, 60, 80, 99],[ts_dataframe.index[0],ts_dataframe.index[20], ts_dataframe.index[40], ts_dataframe.index[60], ts_dataframe.index[80], ts_dataframe.index[99]] ,rotation=45)
plt.ylabel('Normalized Profit')

image.png

image.png

image-2.png

image-3.png

image-4.png

image-5.png

image.png

RESULTS:
========    

db = DBSCAN(eps=0.03, min_samples=6, metric='euclidean', n_jobs=-1)
 0    139811
 3      4950
 2        37
 1        26
-1        19
 4        10


db = DBSCAN(eps=0.04, min_samples=5, metric='euclidean', n_jobs=-1)
 0    144840
-1        13


db = DBSCAN(eps=0.04, min_samples=7, metric='euclidean', n_jobs=-1)
 0    139848
 1      4989
-1        16


db = DBSCAN(eps=0.04, min_samples=7, metric='euclidean', n_jobs=-1)
 0    139848
 1      4989
-1        16


db = DBSCAN(eps=0.1, min_samples=10, metric='euclidean', n_jobs=-1)
 0    198364
 4      4990
 1        66
-1        41
 2        10
 3        10




``

Thus

  1. Lowering the eps value causes less and less anomalies to be determined
  2. Increasing the eps value then...
RESULTS:
========    

db = DBSCAN(eps=0.03, min_samples=6, metric='euclidean', n_jobs=-1)
 0    139811
 3      4950
 2        37
 1        26
-1        19
 4        10


db = DBSCAN(eps=0.04, min_samples=5, metric='euclidean', n_jobs=-1)
 0    144840
-1        13


db = DBSCAN(eps=0.04, min_samples=7, metric='euclidean', n_jobs=-1)
 0    139848
 1      4989
-1        16


db = DBSCAN(eps=0.04, min_samples=7, metric='euclidean', n_jobs=-1)
 0    139848
 1      4989
-1        16


db = DBSCAN(eps=0.1, min_samples=10, metric='euclidean', n_jobs=-1)
 0    198364
 4      4990
 1        66
-1        41
 2        10
 3        10




``